16 research outputs found

    Pengembangan tata bahasa baku bahasa Indonesia (TBBI) daring terpadu

    Get PDF
    Badan Pengembangan dan Pembinaan Bahasa (Badan Bahasa) di bawah naungan Kementerian Pendidikan dan Kebudayaan Republik Indonesia, sebagai instansi pemerintah yang ditugaskan untuk menangani masalah kebahasaan dan kesastraan di Indonesia, menerbitkan berbagai produk kebahasaan. Dua produk yang sering dimanfaatkan para pemelajar bahasa Indonesia adalah Kamus Besar Bahasa Indonesia (KBBI) dan Tata Bahasa Baku Bahasa Indonesia (TBBI). KBBI terbaru edisi kelima (Amalia 2016) diluncurkan pada tahun 2016 dalam tiga versi: cetak, daring, dan luring (Moeljadi et al. 2017). Sejak diluncurkan pada 28 Oktober 2016, KBBI Daring mendapat sambutan hangat masyarakat, baik dari dalam maupun luar negeri. KBBI Daring memudahkan pemelajar bahasa Indonesia dan masyarakat umum menggunakan kamus pada era digital ini. Hal yang serupa dapat dilakukan untuk TBBI. Makalah ini membahas tahap awal pengembangan pangkalan data dan laman TBBI Daring Terpadu dengan menggunakan tata bahasa komputasional bahasa Indonesia INDRA (Indonesian Resource Grammar) (Moeljadi et al. 2015) yang dikembangkan dengan metode rekayasa tata bahasa dengan mengacu pada buku-buku referensi tata bahasa baku bahasa Indonesia, terutama TBBI (Alwi et al. 2014) dan Indonesian Reference Grammar (Sneddon et al. 2010). TBBI Daring Terpadu akan memuat aturan-aturan tata bahasa bahasa Indonesia baku, dipadukan dengan leksikon dan contoh-contoh dari korpus bahasa Indonesia baku yang telah dianotasi secara sintaksis dan semantis. Penulis berharap TBBI Daring Terpadu dapat menjadi acuan utama tata bahasa baku bahasa Indonesia yang dapat diakses dengan mudah oleh para penggunanya, misalnya pemelajar Bahasa Indonesia bagi Penutur Asing (BIPA), dan dapat memperkaya KBBI Daring dalam penggolongan kelas kata yang lebih spesifik, serta mendorong kemajuan bidang linguistik komputasional dan pemrosesan bahasa alami bahasa Indonesia, misalnya dalam penerjemahan mesin dan pengembangan sistem pemeriksaan gramatika dan leksikon bahasa Indonesia baku

    Linguistic studies using large annotated corpora: Introduction

    Get PDF

    Estimating headedness in Indonesian, French, and Finnish

    Get PDF

    Building Cendana: a Treebank for Informal Indonesian

    Get PDF

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Full text link
    Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Full text link
    We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken

    Possessive Verbal Predicate Constructions in Indonesian

    No full text
    This paper deals with verbal predicate constructions used to express 'possession'in Indonesian (both 'formal Indonesian'and 'Colloquial Jakartan Indonesian'). In Moeljadi (2010), I stated that there are eight possessive verbal predicate constructions in Indonesian, i.e. X memiliki Y, X mempunyai Y, X punya Y, X ada Y, X ada Y=nya, X ber-Y, X ber-Y-kan Z, and X Y-an (X represents 'possessor', Y represents 'possessee'or 'possessum', and Z represents a complement.). The analysis of how Indonesian encodes one 'possession'concept to more than one constructions shown above has mainly been based on intuition as a native speaker of Indonesian. The conclusion is that the 'register'and the '(in)alienability'notion play important roles in the encoding process. I previously analyzed this based on intuition in Moeljadi (2010), but this time I conducted interviews in 2010 and 2011 in order to make an objective analysis. The data I got from those interviews were then analyzed using cluster analysis. I conclude that (i) only five constructions, i.e. X memiliki Y, X mempunyai Y, X punya Y, X ada Y, X ber-Y, can be regarded as encoding the meaning of 'possession', (ii) one construction, i.e. X ber-Y, has a special characteristic and takes a different kind of possessee, and (iii) whether the possessor is singular, plural, the first, second, or third personal pronoun, the acceptability of the constructions does not change.インドネシア語の所有動詞述語構文について、その使い分けの条件や法則性、傾向(形態論、統語論、意味論の側面)を考察する。Moeljadi(2010)では、インドネシア語に8つの所有動詞述語構文(X memiliki Y, X mempunyai Y, X punya Y, X ada Y, X ada Y=nya, X ber-Y, X ber-Y-kan Z, and X Y-an)があると述べ、母語話者としての内省で分析し、レジスターと'(in)alienability'が所有を表わす動詞述語の使い分けに重要な役割を担っていると主張した。主に内省で分析したMoeljadi(2010)に対して、筆者は2010年及び2011年に調査を行い、より客観的に分析を試みた。その調査から得られたデータはクラスター分析で分析した。結論としては、(1)5つの構文(X memiliki Y, X mempunyai Y, X punya Y, X ada Y, X ber-Y)だけが所有構文として見做され、(2)X ber-Yは他の4つの構文に比べて、違う特性を持っており、違う所有物をとる。最後に、(3)所有者の人称による構文の違いが現れない。論文 Article

    An Indonesian resource grammar (INDRA) : and its application to a treebank (JATI)

    No full text
    This dissertation describes the creation and the development of an open-source, broadcoverage Indonesian computational grammar, called Indonesian Resource Grammar (INDRA), within the framework of Head-Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994; Sag et al., 2003) and Minimal Recursion Semantics (MRS) (Copestake et al., 2005), using computational tools and resources developed by the DEep Linguistic Processing with HPSG-INitiative (DELPH-IN) research consortium. As a resource grammar, INDRA was employed to build an open-source treebank, called JATI. The research I have conducted on INDRA and its application to JATI was done in four years, from January 2014 to January 2018, during my PhD candidature. Previous work on the computational grammar of Indonesian are mainly done in the framework of Lexical-Functional Grammar (LFG) (Kaplan & Bresnan, 1982; Dalrymple, 2001) such as Arka (2010a) and Mistica (2013). A computational grammar of Indonesian called IndoGram (Arka, 2012) was developed within the LFG-based Parallel Grammar (ParGram) framework, using the Xerox Linguistic Environment (XLE) parser. To the best of my knowledge, no work on Indonesian HPSG has been done. Thus, the development of INDRA can also function as an investigation of the cross-linguistic potency of HPSG and MRS. The approach taken is a corpus-driven approach. The scope is on the analysis and computational implementation of some basic Indonesian constructions and some phenomena in the Indonesian text: from the Nanyang Technological University Multilingual Corpus (NTU-MC) (Tan & Bond, 2012) and from definition sentences in the fifth edition of Kamus Besar Bahasa Indonesia (KBBI) (Amalia, 2016); the later contains 2,003 sentences and was treebanked, named JATI. The lexicon was semi-automatically acquired from various sources: the English Resource Grammar (ERG) (Copestake & Flickinger, 2000) via Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al., 2014), the NTU-MC, and the KBBI definition sentence corpus. The coverage, i.e. the quality and the quantity of parsed sentences in the corpus by the grammar, is evaluated using test-suites. INDRA can parse and generate complex noun phrases with clitics, determiners, numerals, classifiers, and defining relative clause; verb phrases with auxiliaries and voice markers; major copula constructions; compounds; coordination of words and phrases with the same part-of-speech; and subordination. However, at the time of submission, INDRA still cannot handle phenomena such as equative, comparative, and superlative adjective phrases; coordination of words and phrases of different parts-of-speech; possessor topiccomment relative clause with more than one comment; imperatives; and constructions with Wh-question words. These are for future work. Despite its limitations, compared with IndoGram, INDRA has more precision in the analyses for some phenomena and has fifteen times more sentences in the open-source treebank. In addition, INDRA has the potential to be used in various applications such as multilingual machine translation and computer-assisted language learning. Since INDRA is developed in the DELPH-IN community along with other grammars such as the English Resource Grammar (ERG) (Flickinger et al., 2010) using the same semantics (MRS), a semantic-transfer-based machine translation system can be easily built. In summary, INDRA serves as the first, open-source computational grammar for Indonesian which covers most of the important constructions. INDRA has reached to a stage that it has the potential to be applied to various applications such as treebanking, machine translation, and computer-assisted language learning.Doctor of Philosoph

    Usage of Indonesian Possessive Verbal Predicates : A Statistical Analysis Based on Storytelling Survey

    No full text

    Possessive Verbal Predicate Constructions in Indonesian

    No full text
    corecore